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(54) System for Indentifying materials by NIR spectrometry 



(57) In a method for identifying an unknown product 
a library of at)Sorbance spectra of known products is 
measured and stored in a library A quick search using 
clustering techniques is conducted to narrow the search 
to a few products, followed by an exhaustive search of 
the spectra of the few products. More specifically, princi- 
pal component analysis is applied to the absort>ance 
spectra to generate product score vectors extending 
into principal component inside model space which are 
divided into clusters and subclusters in accordance with 
their relative proximity. Hyperspheres are constructed 
around each vector and an envelope is constructed to 
enclose each cluster surrounding the hyperspheres 
within the cluster. The absorbance spectrum of the 
unknown product to be identified is measured and an 
unknown product score vector is determined from the 



unknown product spectrum projecting in principal com- 
ponent inside model space of the clusters. It is deter- 
mined whether or not the unknown product score vector 
falls within one of the envelopes and if so the product 
score vector is projected into the principal component 
inside model space of that cluster and it is determined 
whether or not the unknown product score vector fails 
within any of the subclusters divided from the cluster. 
This process is repeated until the unknown product 
score vector is found to lie in a cluster which is not fur- 
ther subdivided. In this manner, the search is narrowed 
to a few products. An exhaustive search is then carried 
out to match the spectrum of the unknown product with 
the spectra of the known products conesponding to the 
undivided subcluster. 
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Description 

Background of the invention 

This invention relates to a system for identifying materials and. more particularly, to a system making use of infrared 
analysis to match an unknown material with one of a large number of known materials and in this manner identify the 
unknown material as being the same as the material with which it matches. 

Industrial concerns have a need to be able to qualitatively analyze a material to identify the material. For example, 
when a purchased material has been received, it wiir normally be identified by labeling and by shipping documents, but 
these indications are sometimes in error or are missing. By having a system to quickly identify a material when it is 
received, the material can be identified even if it is mislabeled or the identification of the material is nnssing before the 
purchasing company accepts the material and becomes financially responsible for paying for the received material. 

Prior to the present Invention, infrared analysis had been used to qualitatively identify known materials. One such 
system making use of infrared analysis is described in U.S. Patent No. 4,766,551 to Timothy H. Begley issued August 
23, 1 988. In the system of the Begley patent, the near infrared (Nl R) spectrum of a large number of known products are 
measured by detecting the absort>ance of each known product at incremental wavelengths distributed throughout the 
NIR spectrum. The measurements at each incremental wavelength making up i measurements are considered to be 
an orthogonal component of a vector extending in ^-dimensional space. The spectrum of the unknown material is also 
measured and is represented by a vector extending in /-dimensional space. The angle between the vector of the 
unknown product and the vector of each of the known products is calculated and if the angle between the known prod- 
uct and an unknown product is less than a predetermined minimum, the unknown product is considered to be the same 
as the known product. 

The above-described system is reasonably accurate in identifying unknown materials but it takes a substantial 
amount of time to complete the vector analysis to make the comparison of the unknown product with each known mate- 
rial. Accordingly, tiiis system is not suitable for making a rapid identification of an unknown material such as might be 
needed on a loading dock. 

Summary of the Invention 

The present invention may be considered an improvement on the system described in the Begley patent and 
makes it possible to identify unknown materials very quickly 

In accordance with the invention, a series of absorbance spectra are measured for a large number of known prod- 
ucts. A set of spectra for each known product is determined from different batches of the same product having minor 
variations, such as might occur from being manufactured by a different process or by having permissible levels of inpu- 
rities within the product. Each set of spectra comprises absorbance values measured at wavelengths distributed over 
the near infrared range from 11 00 nanometers to 2500 nanometers. The set of spectra for each known product is sub- 
jected to principle component analysis to the condense the data representing each product to the order of 1 0 to 20 val- 
ues or, more specifically, dimensions. The principle component analysis data compression will result in a set of data 
which will contain about 90 percent or more of the original set of spectra for a given product. The values of the con- 
densed data for a given product are considered to represent a vector, extending in multidimensional or hyperspace and 
is refened to as a product score vector. Vectors representing the known products are divided into clusters each consist- 
ing of vectors close to one another in the hyperspace in which the vectors extend. The vectors all extend from a com- 
mon origin in hyperspace and the phrase "position of the vector" refers to the position of tiie end point of the vector in 
hyperspace. Accordingly, the closeness of the vectors refers to the closeness of the end points of the vectors. A hyper- 
sphere, which is a multi-dimensional sphere, is constructed around the position in hyperspace represented by each 
known product score vector. The radius of tiie hypersphere will be a scaler quantity times the Euclidean norm deter- 
mined from the standard deviation spectrum for the product based on the original set of spectra data obtained for tine 
product. By making tiie scaler multiplier 3, the hypersphere will encompass the positions of 99 percent of the product 
vectors of all products which are the same as the known product represented by the product vector at the center of ttie 
hypersphere. Following the consti-uction of the hyperspheres around each point in space represented by each known 
product vector, envelopes are constructed around each group of products which are in the same cluster. These enve- 
lopes are preferably defined by rectangular coordinates in hyperdimensional space and. accordingly correspond to a 
rectangular parallelepiped in three-dimensional space and may be appropriately termed a hyper-rectargular-parallele- 
piped. For convenience this structure will be referred to as a hyper-box. Each hyper-box will be defined by a minimum 
and maximum dimension in each of tiie n-dimensions of the hyperspace. 

The clustering technique initially divides the vectors representing the known products into clusters having different 
numbers of vectors in a given cluster. For example, some clusters may have as many as 50 vectors contained therein 
and other clusters may have only 9 or 10 vectors contained therein or less. When the number of vectors within a cluster 
exceeds a selected nunnber, for example, between 10 and 20, the vectors witiiin that cluster are divided into child clus- 
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ters or subdusters by applying principal component analysis to the spectra of the cluster. The child clusters are further 
divided into grandchild clusters or subclusters to define a hierarchial tree of clusters. The division process is carried out 
until the number of vectors in a given cluster at the most cfivided level does not exceed the selected number, preferably 
between 10 and 20. Each of the clusters and subclusters are surrounded by an envelope in the form of a hyper-box. 
5 The dimensions of each hyper-box are selected so that it encompasses each hypersphere of each known product vec- 
tor within the cluster. 

When an unknown product is received to be identified, the near infrared spectra of the product is measured yielding 
absorbance values extending throughout the near Infrared range, for example, from 1 100 nanometers to 2500 nanom- 
eters. This spectral data is processed so as to construct a score vector representing the product, which vector extends 
10 in the same space and corresponds to the vectors representing known products. In accordance with the Invention, it is 
first determined whether or not the score vector for the unknown product falls within the envelope of any of the highest 
order clusters In the hierarchial tree. If the vector of the unknown product falls in a cluster which has subclusters. a score 
vector of the unknown product is then compared with each of these subclusters at the next level in the hierarchial tree 
to determine if the vector of the unknown product falls within the envelope surrounding one of the subclusters. If the vee- 
rs tor for the unknown product is then determined to fall into a subcluster which is further divided into subdusters. the proc- 
ess is then repeated at the next level down in the hierarchial tree until a score vector of the unknown product is 
determined to fait into a cluster or subcluster at the lowest level in the hierarchial tree. In this manner, the vector will have 
been determined to be the same as one of the products in the cluster or subcluster at this lowest level of the hierarchial 
tree. Since the subcluster or cluster will have at most 10 to 20 products in the cluster, the number of products to which 
20 the unknown product will correspond will have been reduced to 10 to 20 products or less. As indicated above, the 
number of products in the subcluster may be as few as 2 or may be as many as the selected maximum number between 
10 or 20. Following the determination of the final subcluster in which the product falls, an exhaustive comparison is 
made between the spectrum of the unknown product with the spectra and each known product in the final subcluster to 
determine which product the unknown product corresponds. 
25 At any point during a process if it is determined that the vector of the unknown product does not fall within any clus- 
ter or finally to correspond to any product in the final subduster. the unknown product is considered to be what is known 
as an outlier and is determined not to correspond to any of the known products. 

The dustering technique described above permits a quick search of the spectra in the library of known products to 
narrow the number or known products to match the unknown product to a few candidate products in the library. After 
30 this quick search by the above-identified process, an exhaustive comparison is made of the library spectra representing 
just those candidate known products, to which the dustering technique narrowed the search, to positively identify the 
unknown product as matching one of the known products. This exhaustive conparison on the search involves a rela- 
tively time consuming comparison of the spectral data of unknown product with sets of spectral data of the known can- 
didate products to which the search has been narrowed by the clustering technique. Because this comparison is with 
35 just a few candidate products, the entire identification process is reduced to a small fraction of the time required by the 
prior art processes to identify a known product. Yet, because an exhaustive comparison is made of the unknown prod- 
uct with the spectra of the candidate products identified by the clustering technique, the accuracy of the identifying proc- 
ess Is very high and is equal to that of the prior art processes. 

40 Brief Description of the Drawings 

Fig. 1 is a block diagram illustrating the apparatus of the system of the invention; 

Fig. 2 is a block diagram of the process representing the known products by vectors and dividing the vectors into a 
hierarchial tree of clusters and subclusters; 

45 Fig. 3 illustrates an example of how known products might be divided into clusters by the process of the invention; 
Fig. 4 is a schematic illustration of the process of projecting a product mean and standard deviation spectra into 
principal component inside model space as a score vector and surounding the score vector with a hypersphere; 
Fig. 5 is a flow chart representing the process of the system of the invention of comparing an unknown product rep- 
resented by a vector with the clustered vectors of the known products to determine an identification of the unknown 

so product. 

Dftficription of a P referred Embodiment 

The apparatus employed in the system of the present invention comprises a near infrared spectrometer 1 1 having 
55 an osdilating grating 13 on which the spectrometer directs light. The grating 13 reflects light with a nan-ow wavelength 
band through exit slit optics 1 5 to a sample 1 7. As the grating oscillates, the center wavelength of the light that iradiates 
the sample is swept through the near infrared spectrum. Light from the diffraction grating that is reflected by the sample 
is detected by infrared photodetectors 19. The photodetectors generate a signal that is transmitted to an analog-to-dig- 
ital converter 22 by amplifier 20. An indexing system 23 generates pulses as the grating 13 osdilates and applies these 
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pulses to a computer 21 and to the analog-to-digital converter 22. In response to the pulses from the indexing system 
23. the analog-to-digital converter converts successive samples of the output signal of the amplifier 20 to digital values. 
Each digital value thus corresponds to the reflectivity of the sample at a specific wavelength in the near infrared range. 
The computer 21 monitors the angular position of the grating 13 and accordingly monitors the wavelength irradiating 

5 the sample as the grating oscillates, by counting the pulses produced by the indexing system 23. The pulses produced 
by the indexing system 23 define incremental index points at which values of the output signal of the amplifier are con- 
verted to digital values. The index points are distributed incrementally throughout the near infrared spectrum and each 
con-espond to a ditferem wavelength at which the sample is irradiated. The computer 21 converts each reflectivity \^lue 
to an absorbance of the material at the corresponding wavelength. The structure and operation of a suitable spectrom- 

10 eter Is described In greater detail in U.S. Patent No. 4,969.739. 

In accordance with the present invention, the instrument shown in Figure 1 is used to measure the absorbance 
spectra from a large number of known products and stores a library of these spectra in its disc memory. The products 
from which the library of spectra are obtained are selected to be those which will be likely to correspond to an unknown 
product to be identified by the system of the present Invention. The library of the spectra are subjected to principal com- 

15 ponent analysis using singular value decomposition. The singular decomposition algorithm is used to determine princi- 
pal component model space in order to reduce the number of values representing the spectrum of each product in the 
library of products. 

In accordance with the invention, several different samples of each product for the library are obtained and each of 
the different samples of a given product are selected to have minor variations from each other such as would occur from 
20 being provided by different manufacturers or being produced by different manufacturing processes. 

The system of the invention shown in Fig. 1 is used to measure and obtain an absorbance spectrum of each sam- 
ple of each product thus providing a plurality of spectra for each product. Each spectrum is measured at the same incre- 
mental wavelengths. 

The several spectra for the different samples or a product in the library is called a training set. If there are n samples 
25 of a given product so that there are n spectra in a training set for a product, then the spectrum of one sample in the train- 
ing set can be represented as follows: 

X| = [X,i.X|2. ...xi^]"^ (1) 

in which Xj is a column vector made up of reflectance measurements Xj-i through Xj^ taken from the sample i at each of 
the incremental wavelength points 1 through i. In accordance with the invention, the computer determines a mean 
spectra from each product training set by averaging the reflectivity values of the training set at each wavelength to thus 
determine a mean spectra for the product which can be represented as follows: 

x'^^Jcxi+Xa + .^.-i-Xn) (2) 
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40 In the above equation x"" is a column vector of the mean spectrum values and x^ to Xp are each column vectors repre- 
senting the spectrum of each of the samples 1 through n. 

As described in the above-mentioned Begley patent, each known product may be represented by a multidimen- 
sional vector extending In hyperspace wherein each dimension of the vector is defined as a different one of the absorb- 
ance measurements distributed throughout the spectrum. Typically, the near infrared spectrum is measured by 700 

45 incremental measurements distributed throughout the spectrum so that in the system of the Begley patent, each known 
product will be represented by a 700 dimensional vector. 

In the system of the present invention, to reduce the computation required and thereby reduce the amourrt of time 
to make the computation to identify the products, the spectra of the products are subjected to principal component anal- 
ysis. This process reduces the number of numerical values representing each product so that each product may be rep- 

50 resented by a vector extending in hyperspace of substantially fewer dimensions, e.g. 1 0 to 20 dimensions. To carry out 
the principal component analysis, first a global principal component model space is d^ermined from the mean spectra 
representing the library of known products. As a first step of this process, the column vectors of the mean spectra are 
considered to form a matrix X as follows: 

55 X = (x"^i.x'"2-.X'"n) (3) 

In this expression x"^i through x^n each represents a column vector of a mean spectrum for a different one of the prod- 
ucts in the product library of known products 1 through n. This expression for the mean spectral matrix of the product 
library is sinnplifled by dropping the superscript as follows: 
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X = {Xi.X2. - .Xj W 

From the spectra x^ through x^ a mean column vector i is determined by averaging the values in the spectral matrix X 
at each wavelength. Since X is spectral matrix of mean vectors, the mean vector x will actually be a mean vector of a 
set of mean vectors which is determined from the vectors in the matrix X of Equation 3 as follows: 

i = J(Xi+X2 + ...X„) (5) 

wherein x^ through x^ are mean vectors representing the different products in the spectral library. The vector x thus rep- 
resents an average absortance at each wavelength from these mean vectors representing the known products. From 
the global mean vector x. a mean centered matrix X is calculated as follows: 

X = (Xi-X. Xg-X Xp-X) (6) 



Each of the expressions {x^-x) through (Xp-x) is a column vector and each column vector is determined by subtracting 
the global mean vector x from each of the product mean vectors x^ through Xn- Singular value decomposition is then 
20 applied to the mean centered matrix K to obtain 

X = UWV' (7) 

in which U is an i by n matrix of orthonormal vectors, V is an n by n matrix of orthonormal vectors and W is an n by n 
25 diagonal matrix. The diagonal elements of W are 

w^ ^W2 ^...Wr^^O (S) 

and are singular values of X. The values w^ through Wn are defined as the square root of the eigenvalues of the covar- 
30 iance matrix 5f 5( wherein X' is the transpose of X. The eigenvalues of the covariance matrix 5CX con-esponding to the 

squares of w^ through w^ are represented as through The principal components of the mean centered spectral 

data matrix X are the eigenvectors of the covariance matrix X X. which are associated with the nonzero eigenvalues. 

The column vectors of U in the Expression 7 which are associated with nonzero eigenvalues are these eigenvectors 

and are the principal components for the mean centered spectral matrix X. Since there are n different mean centered 
35 product spectra making up the cdumn vectors of X, there are n column vectors in the mean centered matrix X and there 

exist n-1 nonzero singular values of X. Accordingly, there are n-1 principal components of the matrix X. The Expression 

7 can be rewritten in the following standard form in principal component analysis: 

X = LS (9) 

40 

in which L is an / by n-1 matrix expressed as foltows: 

L = [UvU2...u^i] (10) 

45 In the matrix L, u^ through u^.^ are column vectors and comprise the principal components of the mean centered matrix 
X. The matrix L is referred to as the loading matrix. 

In Equation (9) S is an n-1 by n matrix called the score matrix and is represented as follows: 

S = diag (w Wg .... w^^Hv Vg ... v n.^]' (1 1) 



so 



The principal component vectors of the matrix L span a multidimensional space or hyperspace called principal compo- 
nent model space of the. mean centered spectral matrix X. While as represented in the matrix L, the principal compo- 
nents (vectors) are normalized, the vector lengths are of no significance. It is the direction of the principal components 
in representing the coordinates of the principal component model space that is significant. Each of the column vectors 
55 in the mean centered spectral matrix X can be projected in principal component model space and represented as a lin- 
ear combination of coordinate values. 

The magnitude of the eigenvalues through Vi are proportional to the amount of variance in the mean centered 
spectral data matrix X which are modeled by the corresponding principal components represented by the column vec- 
tors ui through Un-i of the matrix L The principal components associated with the largest eigenvalues model the largest 
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fraction of the total variance in X. It is possible to model most of the variance in the spectral matrix X by substantially 
fewer than n-1 principal components. The set of nonzero eigenvalues can be partitioned in two sets, a primary set 
through x^. and a secondary set }^^^ through l^.^. The primary set includes all values substantially greater than zero 
so that the secondary set have minimum values substantially equal to zero. The primary set of eigenvalues will cover a 
large percentage of the cumulative variance of the mean centered spectral matrix X. In the same way the loading matrix 
L can be partitioned into L, and Lq wherein L, is a matrix made up of the column vectors u^ through Uk called the primary 
principal components and Lq is a matrix made of the column vectors Uk+i through u^.^ which are referred to as the sec- 
ondary principal components. Equation (9) can be rewritten as: 

X = L,S, + LoSo 02) 

In this equation. Sj. which is a k by n matrix, can be expressed by following: 

S , = diag (w ^ . w g w ^Hv ^ , v 2 • • v J (13) 

and So. which is an n-k-1 by n matrix, can be represented as follows: 

So =diag(Wk+v Wk+2 ^n•^)\yM*'^K^^ -Vn-ir C^) 

The set of primary principal component vectors represented by u^ through u^ define the orthonormal basis of what is 
referred to as inside model space of the mean centered spectral matrix X. The set of secondary principal components 
represented by vk^.i through v^.i define the orthonormal basis of what is defined as outside model space. Each spec- 
trum represented by the column vectors of X can be decomposed into the sum of two vectors. The first vector projects 
in the inside model space and represents the most significant correlated behavior of the vectors in principal component 
model space, since through covers most of the variance in X. The second vector projects in outside nxxJel space, 
represents the residual variations in the principal component model space, and is considered to be random error. The 
second vectors projecting in outside model space are omitted from the analysis because they represent noise and other 
error terms. The primary principal component vectors Ui through provide an orthonormal basis for K-dimensional 
inside principal component space which describes most of the variation in the original mean center spectral data matrix 
X. The corresponding score matrix S , = [s^ , S2 S „] can be computed by 

S, = L,X (15) 



in which L| is the transpose of L|. 

The vectors Si through Sn are each column vectors having k dimensions and are the principal component scores for the 
con-esponding vectors Xi - x through Xn - x of the mean centered spectral data matrix in principal component inside 
model space. In other words, when one of the vectors X| - x is projected in principal component inside model space the 
values of its coordinates are r^resented by the column vector Sj. The principal component score vector for each mean 
centered spectrum {x^ • x) can be directly calculated by 

Si = L'(Xi-x) (16) 

The above equation gives a linear transformation in which a spectrum represented by a vector having i dimensions is 
transformed to a score vector in the principal component inside model space. In this manner, each of the column vec- 
tors in this mean centered matrix X is transformed into a vector in principal component inside model space having sub- 
stantially fewer dimensions than the original vector. 

Principal component analysis is a known statistical data compression technique and is described in the text entitled 
Factor Analysis by Malinowski. By the process described thus far the entire or global set of mean vectors representing 
the known products are processed by principal component analysis to yield set of score vectors projected into principal 
component inside model space. Since each score vector represents a known product, they are referred to as known 
product score vectors. This process represented in the flow chart of Figure 2 as step 31 in which the global set of spec- 
tra of mean vectors representing known products is transformed into a mean centered matrix X and step 33 in which 
the mean centered matrix is subjected to principal component analysis to yield the known product score vectors for the 
matrix determined in step 31 . These vectors, which each correspond to a different one of the products in the library of 
products, are then separated into clusters in step 35. To determine which known product score vectors will be in which 
cluster, the Euclidean distance between the end point of each known product score vector and each other known prod- 
uct score vector is measured and a minimal spanning tree is constructed linking the vectors. The minimal spanning tree 
is then used to separate the vectors into clusters by dividing the minimal spanning tree at those links of the minimal 
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spanning tree which have the greatest lengths. More spedtlcally, the average length of the nDinimal spanning tree link 
is determined and the standard deviation in the length of the minimal spanning tree links is also determined. The mini- 
mal spanning tree is then separated into clusters of vectors by defining a cluster separation at each minimal spanning 
tree link which is greater in length than the average minimal spanning link plus a scaling factor times the standard devi- 
5 ation in the lengths of the minimal spanning tree links expressed as follows: 



wherein a is the average length of a minimal spanning tree link, k is the scaling factor and 5 ts the standard deviation in 

10 the length of the spanning tree link. 

This process will thus divide the product score vectors from the entire library of products into a set of clusters. 
Typically, most of the clusters wfll contain more than the permitted maximum product score vectors and these clus- 
ters are further divided into clusters by repeating the process limited to the products of each cluster. Thus in step 37. if 
the permitted maximum number of vectors in an individual cluster is 1 0. it Is determined whether any of the cluster have 

15 more than 1 0 vectors. If so, the process proceeds into step 39 in which mean centered matrices are determined for the 
large clusters containing more than 1 0 vectors, a separate matrix being determined for each large cluster from the cor- 
responding mean vectors representing the con-esponding known products. The process then loops back through step 
33 to perform the principal. component analysis on the mean centered matrices determined from the large clusters. This 
process determines a set of principal components for each large cluster to be sulxiivided and then determines a prin- 

20 cipal component inside model space for each large cluster. Following this determination the process proceeds again 
through step 35 to divide vectors of each large cluster into subclusters in the same manner as described above. In this 
subdivision process, the vectors being subdivided will be closer together than in the first iteration through st^ 35 so the 
average length of the minimal spanning tree links between product vectors will be smaller and the criteria for separating 
the vectors into sutx^lusters will therefore be finer. The principal component analysis as described above is thus earned 

25 out to further divide the large clusters containing more than permitted maximum number of product vectors into addi- 
tional clusters called subclusters or child clusters. In a similar manner the child clusters are then further divided into 
grandchild clusters, and great grandchild clusters, and great, great grandchild clusters. H necessary, until each cluster 
or subcluster contains no more than ten products. 

An example. of this clustering technique is illustrated in Figure 3. In this figure the library of products is assumed to 

30 contain 1 50 known products and in the example illustrated the first division of these products into clusters divides them 
into a cluster of two. a cluster of 25. a cluster of 38, a cluster of 33, a cluster of nine, and a cluster of 43 products. In this 
example, the maximum number of vector permitted top be in cluster not to be further subdivided is selected to be ten. 
The clusters of 25. 38, 33 and 43 having more than ten products need to be divided into further subclusters because 
they contain more than ten products and accordingly the cluster containing 25 products is divided into subclusters or 

35 child clusters which in this example are represented as containing three, seven, and 1 5 products, respectively. The sub- 
cluster containing 15 products is then further divided into subclusters or grandchild clusters containing five and ten 
products, respectively. The highest level cluster containing 38 products in the example is shown as being divided into 
subclusters or child clusters containing 15. 17 and seven products, respectively. The child cluster containing 15 prod- 
ucts is then further divided into clusters containing ten and five products, respectively, at the third level, and the child 

40 cluster containing 1 7 products Is further divided into grandchild clusters containing 12 and five products, and the grand- 
child cluster of 12 products is further divided into great grandchild clusters containing four and eight clusters at the 
fourth level. The remaining initial clusters containing 33 and 43 are similarly further divided. In the Illustration, the cluster 
division goes to four levels or great grandchild clusters, but theoretically there is no limit to how many levels the clusters 
may be further sulxiivided. 

45 When all of the large clusters have been subdivided into subclusters the process proceeds from step 37 into step 
41. In this step a hypersphere is constructed around each vector in each cluster, including the parent clusters as well 
as the subclusters. The hypersphere will have the same number of dimensions as the principal component inside model 
space in which it is constructed and therefore will be k-dimensional The radius of each hypersphere is determined from 
the Euclidean norm of the standard deviation spectrum of the training set of spectra for the corresponding product mul- 

50 tiplied times a scaler factor which may be selected by the user. To calculate the Euclidean norm of the standard devia- 
tion spectrum of a training set. the standard deviation is calculated from the training set values at each wavelengtii 
measurement point distributed throughout the spectrum and then the Euclidean norm is calculated from tiiese values 
by taking the square root of the sum of tiie squares as follows: 



a + k6. 
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in which r is the radius and x^j is the standard deviation of the training set absorbances at measurement point i. Typically 
the scaler factor will be selected to be three so that 99% of vectors derived from products corresponding to the product 
of the training set would be represented by a vector falling within the hypersphere. 

The above described process is illustrated in Fig. 4. In this illustration the product mean spectrum x"^} yields a mean 
centered spectaim X| - x, which is projected as a product score vector Sj Into principal component inade model space 
by the operation L\(xi - x). A hypersphere having a radius equal to the Euclidean norm of the standard deviation spec- 
trum of the training set of absorbance spectra for the product multiplied times the scaler factor is constructed around 
the product score vector Sj. 

Following the generation of the hyperspheres, envelopes are constructed in step 43 around each cluster of prod- 
ucts and each subcluster of products. The envelopes are in the form of multidimensional rectangular boxes called 
hyper-boxes having the same number of dimensions as the principal component inside model space (k dimensions), 
and will be constructed so as to encompass each of the vector end points and their sun-ounding hyperspheres. 

The process of Identifying an unknown product is illustrated in Figure 5. In this process, first a quick search of the 
library known spectra is conducted using the clusters defined as described above. In the quick search, the absorbance 
spectrum from the unknown product Is obtained and a score vector of the resulting spectrum is projected into the prin- 
cipal component inside model space constructed for the global set of clusters in the first step of dividing the clusters. To 
can-y out this projection, first the global mean vector x has to be subtracted from the vector representing the unknown 
product, then the resultant mean corrected vector representing the unknown product is multiplied by the projection 
operator L\ wherein L| and L'| have been calculated as described above. The projection operator L'l projects the 
unknown sample spectrum as a score vector in the principal component inside space spanned by the global principal 
component vectors. After the projection of the unknown sample into the prindpal component inside space of the global 
set of mean vectors, the process determines whether or not the resulting projected product vector representing the 
unknown product falls within any of the hyper-box envelopes surrounding the clusters. If the vector representing the 
unknown product falls within the envelope of a cluster, that is one that is not further subdivided, the identification proc- 
ess then proceeds with the exhaustive search comparing the unknown product with each of the products in the cluster, 
as will be described below. 

If the unknown product vector falls in a cluster which is further divided into subclusters. then the above-described 
process must be repeated for the subclusters to determined which of the subclusters a vector representing the 
unknown product falls Into. As explained above, the principal conponent inside model space for dividing each duster 
into subclusters is different than the global principal component inside model space determined for the global set of 
mean vectors and also different from that determined for other subclusters. Accordingly, the process of calculating the 
projection of the unknown product into principal component inside model space must be carried out separately for each 
subcluster. Thus, in carrying out the projection for given subcluster, the mean vector of the mean centered matrix for the 
parent cluster will be subtracted from the unknown product vector to determine the mean con-ected vector representing 
the unknown product. Then this mean con-ected spectrum vedor is multiplied by the projection operator L| for the rele- 
vant principal component inside model space to project the mean corrected unknown produd vedor into this principal 
component inside model space. The relevant prindpal component inside model space will be that hyperspace used to 
divide the parent duster into sutx;lusters. This process Is repeated until the vector representing the product is found to 
fall in a subcluster which is not further divided or otherwise found to fall outside of any subcluster. 

When the cluster identification step of the process has been completed and the unknown product is found to fall 
into an identified cluster or subcluster which is not further divided, an exhaustive search is conducted on the library 
spectra of those known products of that identified cluster or subcluster to determine which product the unknown product 
con-esponds with. This determination may be carried out by several different methods, one of which is the method 
described in the above-mentioned patent to Begley wherein the angle between the multidimensional vedor represented 
by the product spedra is compared with the angles of the mean vedors representing the products in the same duster 
and if the cosine of the angle between the vectors is less then a certain selected minimum, the produd is deemed to 
be the same product. 

A second method is to compare the spedaim of the unknown produd point by point with a spectrum band deter- 
mined from each training set of spedra representing one of the products in the duster The spectrum band is deter- 
mined by calculating the standard deviation of the training set at each wavelength measurement to determine an upper 
and lower limit for the band at each wavelength. The upper and lower limit Is ± px^, in which represents the stand- 
ard deviation at each wavelength and p represents a scaler quantity seleded by the user. An example of such a band 
is illustrated in Figure 4. The product is determined to be the same as a specific product in the cluster if every point in 
its absorbance spedrum falls within the band determined from the training set spectra extending over the near infrared 
of measurement. 

In accordance with a third method of performing an exhaustive search of the library spedra of the produds corre- 
sponding to the cluster identified in the quick search, principal component analysis is applied to each training set of 
spedra for each produd con-esponding to the identified cluster. This determination yields a model of principal compo- 
nent inside space for the training set of each produd of the cluster. To match the unknown sample with a known produd. 
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the score vector from the unknown sample in the iocal principal component inside model space is determined and the 
Mahalanobis distance of the score vector and the mean of the score vectors of the training set is calculated. If the Maha- 
lanobis distance of the score vector of the unknown sample is less than the threshold value selected by the user, the 
unknown sample is determined to match the known sample of the local principal component inside model space. The 
5 Mahalanobis distance between the score vector and the mean of the training set score vectors of a known product can 
be determined from the sum of the squares of the coordinates of the unknown product of the score vector weighted by 
the associated eigenvalues as follows: 

k 

10 c/f,(s)=(n-1)£(sJ/Xy) 

In which d^M(s) represents the Mahalanobis distance, n is the number of samples in the training set, sj is the j*^ coordi- 
nate of the score vector of the unknown sample, and is the j*^ eigenvalue associated with the local principal compo- 

15 nent inside model space derived from the training set of the known product. Geometrically, all score vectors at a 
Mahalanobis distance smaller than a constant will fall within an ellipsoid boundary in local principal component inside 
model space centered on the mean of the score vectors of the training set. 

Each of the above-described methods of product identification is satisfactory with the first described method of the 
Begley patent taking the shortest amount of time, the second method taking more time and the third method taking still 

20 more time to complete the matching of the unknown product with the known product. Accordingly, in accordance with 
the preferred embodiment the three methods are can-ied out in sequence with the method of the Begley patent being 
used to rule out a first group of the products of the cluster, leaving a remaining smaller group to be considered and then 
the second method being used to rule out a subgroup of this second smaller group, leaving a third still smaller group to 
be considered, and then using the Mahalanobis distance method to perform the final matching of the unknown product 

25 with a known product in the product library. 

As described above, the process employs principal component analysis to reduce the dimensions of the vectors 
representing both the known products and the unknown products for comparison of the vectors with the clusters of 
hyperspheres. It will be apparent that the process could be performed without using principal conrtponent analysis and 
simply oonparing the mean product vectors representing each product training set by a vector extending in hyperspace 

30 having dimensions equal to the number of spectral measurements throughout the infrared range and clustering the 
products and comparing the unknown products with the cluster of products in the same manner essentially as 
described above. In addition, other methods instead of principal component analysis may be used to compress the 
dimensions of the vectors representing each product. 

Because the above-described clustering technkijue is used and also because it is used in combination with the prin- 

35 cipal component analysis to reduce the number of dimensions of the vectors, the time required to match a product with 
an unknown product in the library is reduced to a small fraction of the time required by the prior art methods. Accord- 
ingly, the system may be used on a loading dock to quickly determine the contents of a received product before a newly 
delivered product is accepted. 

The above description is of a prefen-ed embodiment of the invention and modification may be made thereto without 

40 departing from the spirit and scope of the invention which is defined in the appended claims. 

Claims 

1 _ A method of matching an unknown product with one of a library of known products comprising the following steps: 

45 

step (1): measuring a near infrared absorbance spectrum for each of said known products, 

step (2): generating known product vectors extending into. hyperspace representing the absorbance spectra 

determined for each of said known products, 

step (3): dividing said known product vectors into clusters of vectors extending into hyperspace wherein the 
so vectors each cluster are closer to each other in hyperspace than the vectors outside of such cluster. 

step (4): dividing at least some of said clusters of vectors into subclusters of vectors extending into hyperspace. 
step (5): repeating said step (4) on at least some of said subclusters until all of said subclusters have fewer 
than a predetermined number of vectors. 

step (6): sunrounding each of said clusters and subclusters with an envelope defined in the corresponding 

55 hyperspace. 

step (7): measuring the absorption spectrum of said unknown product, 

step (8): determining in which of said envelopes surrounding said clusters divided in step (3) a vector, repre- 
senting said unknown product and extending into the hyperspace of said clusters, falls, 
step (9): if the vector representing said unknown product falls into an envelope surrounding a cluster which is 
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divided Into subclusters. then determining In which envelope surrounding a subduster a vector representing 
said unknown product and extending into the hyperspace of such subduster, falls, 

step (10): repeating the step (9) on further divided subclusters until a vector representing said unknown prod- 
uct is determined to fall into an envelope surrounding a subduster which is not further divided, 
5 Step (11): then determining which known product represented by a vector within said last-named envelope said 

unknown product matches. 

2. A methcxJ as recited in daim 1 wherein said step (2) indudes subjecting said absorbance spectra determined of 
said known products to principal component analysis to determined a known product score vedor representing 

10 each known product projected Into principal component Inside model space. 

3. A method as recited in daim 2 wherein said step (4) indudes subjecting the absorbance spectra of the known prod- 
uct vectors of each duster and subduster to principal component analysis to determined a product score vector for 
each known product of the cluster or subduster extending Into prindpai conrponent inside model space determined 

15 for the known product spectra of such cluster or subduster. 

4. A method as recited in daim 2 further comprising sun-ounding each known product score vector with a hyper- 
sphere, said step (6) induding sun-ounding the hyperspheres of the corresponding dusters with each of said enve- 



lopes. 



20 



25 



5. A method as recited in claim 2 wherein each of said envelopes comprises a hyperbox of orthogonal dimensions. 

6. A method as recited In claim 1 further comprising surrounding each of said known product vectors with a hyper- 
sphere, said step (6) Including sunrounding the hyperspheres of the corresponding clusters with said envelopes. 

7. A method as recited In daim 1 wherein each of said envelopes conprlses a hyperbox having orthogonal dimen- 
sions. 

8. A method of matching a product with one of a library of known products comprising 

30 

measuring the absorbance spectra of said known products, 

subjeding said absorbance spectra to prindpai component analysis to determine known produd score vectors 
projecting in prindpai component inside model space, 
sun-ounding said known produd score vectors with hyperspheres. 
35 measuring the absorbance spectrum of said unknown product, 

determining from the spectrum of said unknown product an unknown produd score vedor projecting in prind- 
pai component inside model space, and 

determining in which of said hyperspheres said unknown product score vector falls. 

40 9. A method as recited in daim 8 wherein said absorbance spectra of said known products indudes a training set of 
a plurality of spectra for each of said known products and wherein said method Includes obtaining an average 
absorbance spectra of each training set. determining from said average spedra and the corresponding training set 
a standard deviation for each known product, the hyperspheres each having a radius equal to a selected multiple 
of said standard deviation. 

45 
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(57) In a method for identifying an unknown product 
a library of absorbance spectra of known products is 
measured and stored rn a library. A quick search using 
clustering techniques is conducted to narrow the search 
to a few products, followed by an exhaustive search of 
the spectra of the few products. More specifically, princi- 
pal component analysis is applied to the absorbance 
spectra to generate product score vectors extending 
into principal component inside model space which are 
divided into clusters and subclusters in accordance with 
their relative proximity. Hyperspheres are constructed 
around each vector and an envelope is constructed to 
enclose each cluster surrounding the hyperspheres 
within the cluster. The absorbance spectrum of the 
unknown product to be identified is measured and an 
unknown product score vector is determined from the 



unknown product spectrum projecting in principal com- 
ponent inside model space of the clusters. It is deter- 
mined whether or not the unknown product score vector 
falls within one of the envelopes and if so the product 
score vector is projected into the principal component 
inside model space of that cluster and it is determined 
whether or not the unknown product score vector falls 
within any of the subclusters divided from the cluster. 
This process is repeated until the unknown product 
score vector is found to lie in a cluster which is not fur- 
ther subdivided. In this manner, the search is narrowed 
to a few products. An exhaustive search is then carried 
out to match the spectrum of the unknown product with 
the spectra of the known products corresponding to the 
undivided subcluster. 
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